Based on the following steps, we found that there is 421570 instances in this dataset. Among the dataset, there are total 45 retail Stores sales result in this dataset. In each store, there are about 99 Departments.
Before we get into the analysis, we first deal with the missing values. As we observe the following, the missing values are constantly stacking on Five features which are MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5. Besides, missing values are not found among other features. For this reason, we are going to grub deeper into those features to check where those missing values are correlatable enough to be replaced.
The following is the basic proportion of missing values for every store. Based on the table 1, we could observe that the proportion of missing among these five features are ranging between (63%-90%). With these high volumes of missing values, we are going to grub more to check whether the values are randomly missing or not. Table 2 shows that the range of data without missing values from 2011-11-11 to 2012-10-26 and Table 3 shows that the range of data with missing values from 2010-02-05 to 2011-11-04. In this case, we could observe that the large volume of missing values are NOT randomly missed. In the consideration of the huge volumes as well as the missing patterns, I would propose to drop the five features (i.e. columns) instead of records (i.e. rows) because this is important to keep as many records as we could for the time-series analysis.
This dataset contains two categorical features which are "IsHoliday" and "Store_Type" variables. In order to proceed time-series analysis, these Cat attributes will be dummied into individual variables as Table 4. As you see, the binary "IsHoliday" and "Store_Type" features are divided into 2 columns (i.e. 1,0) and 3 columns (i.e. A,B,C)
Next, we are going to check out the outltiers on continuous variables against every stores.
Weekly_Sales: Due to geographical or holiday features, the Sales associated with every store should varies. Refer to Figure 1, the Weekly sales mot only varies quite a bit, a considerablely large amount data point are out of the range of typical range of sales. As mentioned, the variation might be caused by mixed factors. For this reason, I may just leave those "Outliers" as is for now.
Temperature: Refer to figure 2, the other variable "Temperature" looks very consistant across the stores. Also, there are NO outliers significantly out of their typical range among the stores.
Fuel Price: Refer to figure 3, the feature "Fuel_Price" which is even more consistant than the "Temperature" variable ranges between 2.5 to 4.5 among the stores. So that I would believe that there is NO outliers existing in this feature. This figure makes sense to us because Fuel Price has its market Price across the countries, so that there should be not much variation in each geographical area as expected.
Consumer Price Index (CPI): According to U.S. Bureau of Labor Statistics, the meaning of CPI is "a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services". Refer to Figure 4, the CPI value, ranging from 120 to 230, differs quite much among the stores. However, there is NO existing outliers shown in this dataset. This figure relatively indicates us the consuming power across stores. For hypothetical assumption, the area with higher CPI values should expect more consuming power which is supposed to generate more sales amount for the store, or vice versa. Further analysis will be conducted to measure this aspect of feature along with this analysis proceed.
Unemployment: The unemployment rate is another index shows the consuming power in the country. Refer to the Figure 5, the data shows that most of the values fall between 6 to 8 withouth outliers among the stores. Similar to the "CPI" feature, Further analysis will be conducted to measure how this feature influence to the Weekly Sales among the stores.
import pandas as pd
import numpy as np
trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")
df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)
#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']
display(train.head(3))
cntLst = []
print("Table 1")
print('''Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''')
print()
for store in train.Store.unique():
for var in ["MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"]:
cntLst.append(train[var][train.Store == store].isnull().sum())
print("Store #%d"%(store), end="")
print(" with %d instance has missing values"%(train[train.Store == store].shape[0]), end=" ")
print(cntLst, end=" ")
print("in percent", end=" ")
print(np.round(np.array(cntLst) / train[train.Store == store].shape[0], 2)*100, end="\n")
cntLst=[]
cntLst = []
print("Table 2")
print('''Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
for store in train.Store.unique():
arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == False][train.Store == store].unique()))
print("Store %d %s - %s"%(store,min(arr),max(arr)))
cntLst = []
print("Table 3")
print('''Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
for store in train.Store.unique():
arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == True][train.Store == store].unique()))
print("Store %d %s - %s"%(store,min(arr),max(arr)))
print("Table 4")
print('''Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 ''')
#display(pd.get_dummies(train[["IsHoliday_x"]]))
pd.concat((pd.get_dummies(train["Store_Type"]),pd.get_dummies(train["IsHoliday"])), axis=1).head(5)
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline
init_notebook_mode(connected=True)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Weekly_Sales[train.Store == int(name)]/1000,2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Sales Values by each store (Figure 1)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Weekly Sales (in thousands)",
titlefont=dict(size=18),
zeroline=True, range=[-10,200], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Temperature[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Temperature Values by each store (Figure 2)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Temperature(F)",
titlefont=dict(size=18),
zeroline=True, range=[-5,120], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Fuel_Price[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Fuel Price Values by each store (Figure 3)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Fuel Price($)",
titlefont=dict(size=18),
zeroline=True, range=[2,5], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.CPI[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of CPI Values by each store (Figure 4)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="CPI",
titlefont=dict(size=18),
zeroline=True, range=[100,250], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
namesLst=list(map(str,train.Store.unique()))
aa=[]
salesLst=[]
for name in namesLst:
aa=round(train.Unemployment[train.Store == int(name)],2)
salesLst = salesLst + [aa]
traces=[]
for name, sales in zip (namesLst, salesLst):
traces.append(go.Box(
x=sales,
name=name
))
layout = go.Layout(
title = 'Range of Unemployment Rate by each store (Figure 5)',
#yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
autosize=False,
width=700,
height=1000,
yaxis =dict(title = "Store Number",
exponentformat='e',
showexponent='all',
titlefont=dict(size=18),
tick0=5,ticks="outside",
dtick=1,
tickwidth=2,
showgrid=True),
xaxis = dict(title="Unemployment Rate",
titlefont=dict(size=18),
zeroline=True, range=[0,20], showgrid=True),
margin = dict(l=60,r=30, b=80, t=40),
showlegend=False
)
fig = go.Figure(data=traces, layout=layout)
iplot(fig)